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Abstract 

Recently, considerable research efforts have been devoted to the design 
of methods to learn from data overcomplete dictionaries for sparse coding. 
However, learned dictionaries require the solution of an optimization prob- 
lem for coding new data. In order to overcome this drawback, we propose 
an algorithm aimed at learning both a dictionary and its dual: a linear map- 
ping directly performing the coding. By leveraging on proximal methods, 
our algorithm jointly minimizes the reconstruction error of the dictionary 
and the coding error of its dual; the sparsity of the representation is in- 
duced by an t\ -based penalty on its coefficients. The results obtained on 
synthetic data and real images show that the algorithm is capable of re- 
covering the expected dictionaries. Furthermore, on a benchmark dataset, 
we show that the image features obtained from the dual matrix yield state- 
of-the-art classification performance while being much less computational 
intensive. 
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1 Introduction 



The representation of a signal as the superposition of elementary signals, or 
atoms, is the pillar of a number of research fields and analysis techniques. The 
best-known example of such methods is the Fourier transform, where the atoms 
form an orthonormal basis and every signal has a unique representation. Al- 
though an orthonormal basis would seem the most natural choice for decompos- 
ing a signal, overcomplete dictionaries (or frames) are nowadays commonplace 
and their use is both theoretically justified and supported by experimentally 
successful applications [1|. Tight frames are a class of overcomplete dictionar- 
ies with the particular property of ensuring that the optimal representation can 
still be recovered, as with orthonormal bases, by means of inner products of the 
signal and the dictionary. 

The goal of this paper is to introduce an algorithm - that we called PADDLE - 
capable of learning from data a dictionary endowed with properties similar to 
that of tight frames. Indeed, the proposed method generates both the optimal 
dictionary and its (approximate) dual: a linear operator that decomposes new 
signals to their optimal sparse representations, without the need for solving any 
further optimization problem. We stress that the term dual has been adopted in 
keeping with the terminology used for frames, and does not refer to the concept 
of duality common in optimization. 

Over the years considerable effort has been devoted to the design of methods 
for learning optimal dictionaries from data. Although not yielding overcom- 
plete dictionaries, Principal Component Analysis (PCA) and its derivatives are 
at the root of such approaches, based on the minimization of the error in recon- 
structing the training data as a linear combination of the basis elements. The 
seminal work of Olshausen and Field [2] was the first to propose an algorithm 
for learning an overcomplete dictionary in the field of natural image analysis. 
Probabilistic assumptions on the data led to a cost function made up of a recon- 
struction error and a sparse prior on the coefficients, and the minimization was 
performed by alternating optimizations with respect to the coefficients and to 
the dictionary. Most subsequent methods are based on this alternating scheme 
of optimization, with the main differences being the specific techniques used 
to induce a sparse representation. Recent advances in compressed sensing and 
feature selection led to use £ or l\ penalties on the coefficients, as in Q and 
HIS, respectively. 

In El [3 |8]|, the authors proposed to learn a pair of encoding and decoding 
transformations for efficient representation of natural images. In this case the 
encodings of the training data are dense, and sparsity is introduced by a further 
non-linear transformation between the encoding and decoding modules. Build- 
ing on the idea of directly learning an encoding transformation, we formulated 
the problem in the framework of regularized optimization, by defining an l\ 
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penalized cost functional as in ]4l |5]| , augmented by a coding term. This term 
penalizes the discrepancy between the optimal sparse representations and those 
obtained by the inner products of the dual matrix and the input data. The ad- 
vantage of this approach in terms of performance stands out at evaluation time, 
when coding a new vector only requires a single matrix-vector product. 
The minimization of the proposed functional may be achieved by block coordi- 
nate descent, and we rely on proximal methods to perform the three resulting 
inner optimization problems. Indeed, in recent years different authors provided 
both theoretical and empirical evidence that proximal methods may be used to 
solve the optimization problems underlying many algorithms for l\ -based reg- 
ularization and structured sparsity. A considerable amount of work has been 
devoted to this topic within the context of signal recovery and image process- 
ing. An extensive list of references and an overview of several approaches can be 
found in I|9| I10[|11II12[|13L and fl 1411 in the specific context of machine learning. 
Proximal methods have been used in the context of dictionary learning by [11 511 . 
where the authors are more focused on the combination of dictionary learning 
with the notion of structured sparsity. In particular they introduce a proximal 
operator for a tree-structured regularization that is particularly relevant when 
one is interested in hierarchical models. 

Up to our knowledge, the present work is the first attempt to cast the problem 
of jointly learning a dictionary and its dual in the framework of regularized 
optimization. We show experimentally that PADDLE can recover the expected 
dictionaries and duals, and that codes based on the dual matrix yields state-of- 
the-art classification performance while being much less computational inten- 
sive. 

2 Proximal methods for learning dual dictionaries 

In this section we lay down the problem setting and give a brief overview on 
proximal methods, and how they can be applied to the problem at hand. The 
actual PADDLE algorithm, with more details on its implementation, is described 
in Section |3] 

Let X = [x\, . . . , xn] € R dxN be the matrix whose columns are the training 
vectors. Our goal is to learn a primal dictionary D = [di, . . . , d K ] e R dxK (the 
decoding or synthesis operator), and its dual C = [ci, . . . , ck] t € R Kxd (the en- 
coding or analysis operator), under some optimality conditions that we will de- 
fine in short. The columns of D are the atoms of the dictionary, while the rows 
of C can be seen as filters that are convolved with an input signal x to encode 
it to a vector u e R K . Both the atoms and the filters are constrained to have 
bounded norm to avoid a trivial solution to the problem. 

Let now U = [u%, . . . ,ttjv] € R KxN be the matrix whose columns are the en- 
codings of the training data. We set out to learn both D and its dual C by 
minimizing 

E(D,C 7 U) = -j\\X - DU\\% + ~\\U - CX\\% + IKIli) CD 

i=l 

s.t. HdiilMic,!! 2 < i 
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where t > is a regularization parameter inducing sparsity in U, while 77 > 
weights the coding error with respect to the reconstruction error. 
Since the functional[T]is separately convex in each variable, we proceed by block 
coordinate descent (also known as block nonlinear Gauss-Seidel method) 111 611 . 
iteratively minimizing first with respect to the encoding variables U (sparse cod- 
ing step), and then to the dictionary D and its dual C (dictionary update step). 
Such approach has been proved empirically successful [4], and its convergence 
towards a critical point of E is guaranteed by Corollary 2 of H17H ■ 
The minimization steps both with respect to U and w.r.t. D and C cannot 
be solved explicitly, and therefore we are forced to find approximate solutions 
using an iterative algorithm. In order to do so, we strongly rely on the com- 
mon structure of the three minimization problems, consisting in the sum of a 
convex and differentiable function and a non-smooth convex penalty term or 
constraint. The presence of a non differentiable term makes the solution of 
the minimization problems non trivial. Proximal methods proceed by splitting 
the contribution due to the non-differentiable part, giving easily implementable 
algorithms. 

In summary, a proximal (or forward-backward splitting) algorithm minimizes 
a function of type E(£) — + J(£), where F is convex and differentiable, 
with Lipschitz continuous gradient, while J is lower semicontinuous, convex 
and coercive. These assumptions on F and J, required to ensure the existence 
of a solution, are fairly standard in the optimization literature (see e.g. Ullll ) 
and are always satisfied in the setting of dictionary learning for visual feature 
extraction. The non-smooth term J is involved via its proximity operator, which 
can be seen as a generalized version of a projection operator: 

P(x) = argmin{ J{y) + \\\x- y\\ 2 }. (2) 
y £ 

The proximal algorithm is given by combining the projection step with a forward 
gradient descent step, as follows 

e = p (V 1 - ^v^" 1 )) > ( 3 ) 

The step-size of the inner gradient descent is governed by the coefficient a, 
which can be fixed or adaptive, and whose choice will be discussed in Section 



2.3 In particular, it can be shown that E(^ p ) converges to the minimum of E if 
a is chosen appropriately HI 111 . The convergence of £ p towards a minimizer of 
E is discussed below. 

2. 1 Sparse coding 

With fixed D and C, we can apply algorithm ^ to minimize the functional 

E(U) = F(U) + J(U), where 

1 9 N 

F(U) = -\\X-DU\\ 2 F + ^\\U-CX\\ 2 F and J(U) = -L £ ( 4 ) 

»=i 

The gradient of the (strictly convex) differentiable term F is 

V V F = - 2 ,D T (X - DU) + ^(U - CX), 
a A 
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while the proximity operator corresponding to J is the well-known soft-thresholding 
operator S\ defined component-wise as 



(S x [U])ij = sign(f/ JJ )max{|C/ y | - A,0}. 



Plugging the gradient and the proximal operator into the general equation ([3]), 
we obtain the following update rule: 



U" 



r/Kau 



\ U"- 1 + —( -D T (X - DUP- 1 ) + ^Cx) 
J <ru \d K J 

(5) 



2.2 Dictionary update 

When U is fixed, the optimization problems with respect to D and C are de- 
coupled and can be solved separately. 

The quadratic constraints on the columns of D and the rows of C are equivalent 
to an indicator function J. Denoting by B the unit ball in R d , the constraint on 
D (respectively C) is formalized with J being the indicator function of the set 
of matrices whose columns (resp. rows) belong to B. Due to the fact that J is an 
indicator function, in both cases the proximity operator is a projection operator. 
Denoting by ir(d) = dj max{l, ||d||} the projection on the unit ball in R d , let ttd 
be the operator applying tt to the columns of D and ttc the operator applying tt 
to the rows of C. 

Plugging the appropriate gradients and projection operators into Eq. ((3) leads 
to the update steps 

D p = TToiDP- 1 + -^—{X - D p - 1 J7)[/ T ), (6) 
d(T D 

C p = nciCP- 1 + -J—(U -C p - 1 X)X T ). (7) 



2.3 Gradient descent step 

The choice of the step-sizes au, &d and oc is crucial in achieving fast conver- 
gence. Convergence of each minimization step is discussed in two senses: with 
respect to the functional values, e.g. of E(D P ) — min D E(D), and with respect 
to the minimizing sequences, e.g. of D p . 



Fixed step-size 

In general, for E — F + J, one can choose the step-size to be equal, for all itera- 
tions, to the Lipschitz constant of VF. For VqF and VcF these constants can 
be evaluated explicitly, leading to a D = 2\\UU t \\f/cI and a c = 2\\XX T \\ F /K . 
Such choices ensure linear rates of convergence in the values of E J9) and con- 
vergence of the sequences D p and C p towards a minimizer H11L although no 
convergence rate is available. 

A similar derivation would lead to au = 2|| ^D T D + £l\\, but in this case the 
positive definiteness of the matrix allows us to choose a step-size with faster 
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convergence H18H • Denoting by a and b be the minimum and maximum eigen- 
values of D T D, choosing 

a + b r\ 

will lead to linear convergence rates in the value, as well as linear convergence 
for the sequence U p to the unique minimizer of E(U) H18I . 

FISTA 

Improved convergence rates can be achieved in two ways: either through adap- 
tive step-size choices (Barzilai-Borwein method), or by slightly modifying the 
proximal step as in FISTA J9) . The PADDLE algorithm makes use of the latter 
approach. 

The FISTA update rule is based on evaluating the proximity operator with a 
weighted sum of the previous two iterates. More precisely, defining a\ = 1 and 
(j) 1 = the proximal step ([3]) is replaced by 

C = P(V-^VF(^)V (8) 



(1 + ^l + 4a2)/ 2 (9) 

= + (10) 

Choosing a as in the fixed step-size case, this simple modification allows to 
achieve quadratic convergence rate with respect to the values J9j, which is 
known to be the optimal convergence rate achievable using a first order method 
||19| . Although convergence of the sequences D P ,C P and U p is not proved 
theoretically, there is empirical evidence that it holds in practice [10]. Our 
experiments confirm this observation. 



3 The PADDLE algorithm 

The complete algorithm we propose is outlined in Algorithm [T] As previously 
explained, the algorithm alternates between optimizing with respect to U, D 
and C. These three optimizations are carried out employing the iterative pro- 
jections defined in equations ([5]), ([6]) and ([7]), respectively, adapted according 
to equations ([8-101. After the first iteration of the algorithm, the three inner 



optimizations are initialized with the results obtained at the previous iteration. 
This can be seen as an instance of the popular warm-restart strategy. 
During the iterations it may happen that, after the optimization with respect 
to U, some atoms of D are used only for few reconstructions, or not at all. 
As suggested in 0, if the i-th atom dj is under-used, meaning that only few 
elements of the i-th row of U are non-zero, we can replace it with an example 
that is not reconstructed well (see line [4]). If xj is such an example, this can 
be achieved by simply setting Uj = e i; since at the next step D and C are 
estimated from U and X. In our experiments we only replaced atoms when not 
used at all. 

The iterative procedure is stopped either upon reaching the maximum number 
of iterations, or when the energy decreases only slightly with respect to the last 
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Algorithm 1 PADDLE 

Require: X e M. dxN # input data 

Require: U Q e R KxAr , D° e R dxK , C° e M Kxd # initialization 
Require: rtol, r, 77 > 0, T moa! e Z+, 1< < T ma2; # algorithm parameters 

1: £° =S(D ,C ,(7 ;X,77,r) 

2: for < = 1 fo T TOaa . do 

3: L7* = argmin u £'(£> t ~\ C'^LTjA:, 77,7-), see Eq. (|5]> 
4: possibly replace under-used atoms, see Sec. [3] 
5: D* = argmin^ E(D, C*' 1 ,U t ; X,t],t), see Eq. © 
6: C* = argmin c E{D\C, U*; X, 77, r), see Eq. © 
7: E* = E{D\C t ,U t ;X,^T) 

8: £tf = E t= max{p-iJ ; 0} 

9: // \E l - E H \/E H < rtol then 

10: break 

ii: end if 

12: end for 

13: return U\ D l , C* 



iterations. In our experiments we found out that, in practice, after a few 
hundreds of outer iterations the convergence was always reached. Indeed, in 
many cases only a few tens of iterations were required. 

It is worth noting here, that in our implementation the algorithm optimizes with 
respect to all codes itj simultaneously, see line[3] Although not strictly necessary, 
it is a possibility we opted for, since we confident it could prove advantageous 
in future hardware-accelerated implementations. However, the algorithm can 
be easily implemented with a sequential optimization of the it;. 
A reference implementation in Python, together with scripts for replicating the 
experiments of the following sections, are available online at http : //slipguru . 
disi .unige . it/Research/PADDLE, 

4 Experiments 

The natural application of PADDLE is in the context of learning discriminative 
features for image analysis. Therefore, in the following we report the experi- 
ments on standard datasets of digits and natural images, in order to perform 
qualitative and quantitative assessments of the recovered dictionaries for vari- 
ous choices of the parameters. Furthermore, we discuss the impact of the fea- 
ture vectors obtained from a learned C on the accuracy of an image classifier. 
Preliminarily, we also present a set of experiments on synthetic data, aimed at 
understanding how well the algorithm can recover a dictionary and its dual in 
a more controlled setting. 

4. 1 Synthetic data 

The first synthetic experiment has been performed with K < d = 25 by gen- 
erating N = 10 4 training vectors as linear combinations of 15 random vectors. 
The data have been corrupted by additive Gaussian noise. We have run the 
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Figure 1: Results of the synthetic experiments, (left) Experiment for K < d 
and r = 0. The reconstruction error (solid line) converges to the optimal error 
(dash-dotted line) and the distance between C and D T (dashed black) decades 
with it. (right) Synthetic experiments with a tight frame. From top to bottom, 
we show the original dictionary the recovered dictionary D and recovered dual 
C T . The order in not necessarily the same as in the original, and some of the 
atoms could be inverted. 



algorithm with K = 15, r] = 1 and a small £ 2 regularization term on the coeffi- 
cients U, in order to ensure stability. In this setting we expected the algorithm 
to recover a dictionary close to the basis of the first 15 principal components, as 
well as converging to a dual C close to the transpose D T . In the experiments 
the distance between D and the PCA basis, assessed by the largest principal 
angle between the spanned subspaces, has decreased under 10~ 7 after very few 
iterations. In Figure [T] left image, we also show that the reconstruction error 
converges to the minimum achievable with the first K principal components 
and that the dual C also converges to D T . The distance between D and its 
dual has been computed as the mean value of 1 — |dj • Cj|/(||dj||||cj||) (where 
the Cj are the rows of C). 

In the second experiment we have constructed a tight frame, shown in Figure [T] 
(top, right), and generated N = 10 4 training vectors as random superposition of 
3 frame elements. The data have been again corrupted with additive Gaussian 
noise, and the algorithm run with K = 50 (the same number of frame ele- 
ments), 7] = 1 and r = 0.5. The reconstruction error has reached immediately 
the minimum achievable with the true generating frame, and the distance be- 
tween the C and D T , computed as in the previous experiment, has got smaller 
than 10 -2 . In Figure [l] (right) we show the generating frame and compare it 
with the recovered dictionary D and dual C T . As apparent, most elements are 
present in both dictionaries (possibly inverted). 
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4.2 Benchmark datasets: digits and natural images 



In the next experiments we have applied PADDLE to two sets of images widely 
used as benchmarks in computer vision: the Berkeley segmentation dataset H201 
and the MNIST dataset [21]. The aim of this section is to offer a qualitative and 
quantitative assessment of the dictionaries recovered from these two classes of 
images, with realistic sizes of the training set. 

Berkeley segmentation dataset 

Following the experiment in J6) Sec. 4] , we have extracted a random sample of 
10 5 patches of size 12x12 from the Berkeley segmentation data set [20]. The 
images intensities have been centered by their mean and normalized dividing 
by half the range (125). The patches have been separately recentered too. 
We have run PADDLE over a range of values for r, and both with coding error 
(rj = 1) and without (?/ = 0). The relative tolerance for stopping has been set 
to 10~ 4 . The reconstruction error achieved at the various level of sparsity (r), 
both with and without the coding error, have been constantly lower than the 
reconstruction error achievable with a comparable number of principal compo- 
nents. 

In Figure [2] we show images of the recovered dictionary, together with their 
duals. An interesting effect we report here is that different levels of sparsity in 
the coding coefficients also affects the visual patterns of the dictionary atoms. 
The more sparse is the representation the more the (very few) atoms tends to 
look like simple partial derivatives of a 2D Gaussian kernel, i.e. the dictionary 
tends to adapt poorly to the specific set of data. On the contrary, with a less 
sparse representation, a larger number of the atoms seem to encode for more 
structured local patterns or specific textures present in the dataset. 

MNIST dataset 

Next we have tested the algorithm on the 50, 000 training images of the popu- 
lar MNIST data set [21], which is a collection of 28 x 28 quasi binary images 
of handwritten digits. According to the experiments described in J6l, we have 
trained the dictionary with 200 atoms which is very likely to represent an over- 
complete setting because the true dimensionality of the images is definitely less 
than 784. All the images have been pre-processed by mapping their range into 
the interval [0, 1]. The results obtained are consistent with those already re- 
ported in the literature. In particular, the learned dictionary D comprises the 
most representative digits from which it is possible to reconstruct all the others 
with a low approximation error. In Figure [3] (bottom) we show how the exem- 
plar digit on the left can be expressed in terms of the small subset of the atoms 
in the middle, obtaining the approximate image on the right. As expected, the 
actual number of non-zero coefficients is extremely low if compared with the 
size of the dictionary. In the two rows in the middle, we first report all the dic- 
tionary atoms with non-zero coefficients and then we weight them with respect 
to their relevance in the reconstruction. 

A second, more interesting aspect is the fast empirical convergence of the algo- 
rithm with such a well-structured dataset, as shown in Figure [3] (on the top). 
The initial dictionary has been built with random patches. After the first itera- 
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Figure 2: Decoding (top) and coding (bottom) matrices D and C learned from 
100, 000 patches of size 12 x 12 randomly sampled from the Berkley segmen- 
tation dataset. The dictionaries have been learned with different values of r: 
from left to right, 1, 0.1 and 0.02. The atoms are in column-major order from the 
most used to the least one. The codes corresponding to the three dictionaries 
pairs uses on average 6.6, 65.9 and 139.8 atoms, respectively. 



Initialization After 1 iteration After 4 iterations At convergence 




> 3 4 : • / 



Figure 3: Experiments on MNIST dataset. See section 4.2 for details. 



tion it was already possible to inspect some digits, and the amount of change 
decreased rapidly reaching a substantial convergence after only a few iterations. 
The dictionary after 20 iterations (corresponding to the full convergence) was 
almost identical to the one after 4 iterations only. 
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4.3 Classification 



In this last group of experiments we have focused on the impact of using the 
dictionaries learned with the PADDLE algorithm in a classification context. More 
specifically, we have investigated the discriminative power of the sparse coding 
associated to the dictionary D and its dual C when used to represent the visual 
content of an image. The goal of the experiments has been to build a classifier 
to assign each image to a specific semantic class. In practice, we replicated 
the experimental setting of H22H . The classification results, reported in Table [T] 
show that the performance we have obtained using a representation computed 
with PADDLE is essentially the same - i.e. the results are not distinguishable 
within one standard deviation - as the one obtained with learned dictionary 
used by the authors in the original paper. 



Encodings 


D 


C 


[22] 


Mean accuracy 
Standard deviation 


0.987 
0.008 


0.984 
0.008 


0.985 
0.008 



Table 1: Classification performances obtained on a subset of the popular Cal- 
techlOl dataset comprising two classes. 

The results obtained with the dictionary C are especially encouraging if one 
consider the substantial gain in the computational time required to compute the 
sparse codes, with a fixed dictionary for each new input image. In our experi- 
ments we have been able to process each image in less than 0.21 seconds, while 
it took 2.3 seconds per image, on the same machine, to use the implementation 
of the feature-sign search algorithm J4) provided with the software package Sc- 
SPM (available at http://www.ifp.illinois.edu/ jyang29/ScSPM.htm). In- 
deed, regardless the specific implementation of the sparse optimization method, 
it is easy to see that using C is always the best choice since it requires just one 
matrix-vector multiplication. 

5 Conclusion 

We have proposed a novel algorithm based on proximal methods to learn a dic- 
tionary and its dual, that can be used to compute sparse overcomplete represen- 
tations of data. The experiments have shown that for image data the algorithm 
yields representations with good discriminative power. In particular, the dual 
dictionary can be used to efficiently compute the representations by means of a 
simple matrix-vector multiplication, without any loss of classification accuracy. 
We believe that our method is a valid contribution towards building robust and 
expressive dictionaries of visual features. 
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